{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hackathon #2\n",
    "\n",
    "Written by Eleanor Quint\n",
    "\n",
    "Topics: \n",
    "- Dense layers\n",
    "- Training by minibatch/gradient step and epoch\n",
    "- Splitting the dataset into train/validation\n",
    "\n",
    "This is all setup in a IPython notebook so you can run any code you want to experiment with. Feel free to edit any cell, or add some to run your own code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We'll start with our library imports...\n",
    "from __future__ import print_function\n",
    "\n",
    "import numpy as np                 # to use numpy arrays\n",
    "import tensorflow as tf            # to specify and run computation graphs\n",
    "import tensorflow_datasets as tfds # to load training data\n",
    "import matplotlib.pyplot as plt    # to visualize data and draw plots\n",
    "from tqdm import tqdm              # to track progress of loops"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A First Attempt at Classifying MNIST\n",
    "\n",
    "MNIST is a dataset of greyscale 28x28 handwritten digits labelled 0 through 9. We'll use it for a 10-class problem to learn the basics of classification.\n",
    "\n",
    "Let's have a look at the data first. We'll load the data from [Tensorflow Datasets](https://www.tensorflow.org/datasets) and visualize it with matplotlib's `plt.imshow`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = tfds.load('mnist', shuffle_files=True) # this loads a dict with the datasets\n",
    "\n",
    "# We can create an iterator from each dataset\n",
    "# This one iterates through the train data, shuffling and minibatching by 32\n",
    "train_ds = ds['train'].shuffle(1024).batch(32)\n",
    "\n",
    "# Looping through the iterator, each batch is a dict\n",
    "for batch in train_ds:\n",
    "    # The first dimension in the shape is the batch dimension\n",
    "    # The second and third dimensions are height and width\n",
    "    # Being greyscale means that the image has one channel, the last dimension in the shape\n",
    "    print(\"data shape:\", batch['image'].shape)\n",
    "    print(\"label:\", batch['label'])\n",
    "    break\n",
    "\n",
    "# visualize some of the data\n",
    "idx = np.random.randint(batch['image'].shape[0])\n",
    "print(\"An image looks like this:\")\n",
    "imgplot = plt.imshow(batch['image'][idx])\n",
    "print(\"It's colored because matplotlib wants to provide more contrast than just greys\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dense layers\n",
    "\n",
    "The first step to building a simple neural network is to specify layers. The most basic building block is the dense layer (AKA linear layer or fully connected layer), so we'll declare a function that creates the layer. Each dense layer is composed of two variables, the weight matrix `W` and the bias vector `b` as well as a non-linear activation function `a`, to calculate the function `f(x) = a(Wx + b)`.\n",
    "\n",
    "Normally we'll use pre-defined layers, but in this notebook we'll do it ourselves first to better understand what's going on under the hood."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Dense(tf.Module):\n",
    "    def __init__(self, output_size, activation=tf.nn.relu):\n",
    "        \"\"\"\n",
    "        Args:\n",
    "            - output_size: (int) number of neurons\n",
    "            - activation: (function) non-linear function applied to the output\n",
    "        \"\"\"\n",
    "        self.output_size = output_size\n",
    "        self.activation = activation\n",
    "        self.is_built = False\n",
    "        \n",
    "    def _build(self, x):\n",
    "        data_size = x.shape[-1]\n",
    "        self.W = tf.Variable(tf.random.normal([data_size, self.output_size]), name='weights')\n",
    "        self.b = tf.Variable(tf.random.normal([self.output_size]), name='bias')\n",
    "        self.is_built = True\n",
    "\n",
    "    def __call__(self, x):\n",
    "        if not self.is_built:\n",
    "            self._build(x)\n",
    "        return self.activation(tf.matmul(x, self.W) + self.b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first dimension of the input is the \"batch\" dimension, which allows us to run many data through the model simultaneously. The matrix `W` has a row for each input dimension so that each column corresponds to the weights of one linear unit of the layer. After adding the bias vector to the vector resulting from the vector-matrix multiplication, we activate with a non-linearity.\n",
    "\n",
    "Let's define a simple, two layer network with this function. We activate the first layer with the rectified linear function [`tf.nn.relu`](https://www.tensorflow.org/api_docs/python/tf/nn/relu), but not the second layer so that we can interpret its output as the [logits](https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow) of a discrete probability distribution. Note that we're going to flatten the data into a vector (784 = 28 x 28) so that we can use it with a linear layer (we encountered `tf.reshape` in the last hackathon). Loss is calculated with [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy), which implies that we're interpreting the output of the neural network as the paramters of a [categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution).\n",
    "\n",
    "Further, we'll train with minibatches of data with the for loop (using tqdm for a progress bar). We run the data forward in each minibatch, calculate the loss using the output logits and correct label, calculate gradients, and finally backprop gradients using the "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_layer = Dense(200)\n",
    "second_layer = Dense(10)\n",
    "\n",
    "loss_values = []\n",
    "accuracy_values = []\n",
    "# Loop through one epoch of data\n",
    "for batch in tqdm(train_ds):\n",
    "    # run network\n",
    "    x = tf.reshape(tf.cast(batch['image'], tf.float32), [-1, 784]) # -1 means everyting not otherwise accounted for\n",
    "    labels = batch['label']\n",
    "    x = first_layer(x)\n",
    "    logits = second_layer(x)\n",
    "    \n",
    "    # calculate loss\n",
    "    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels)\n",
    "    loss_values.append(loss)\n",
    "    \n",
    "    # calculate accuracy\n",
    "    predictions = tf.argmax(logits, axis=1)\n",
    "    accuracy = tf.reduce_mean(tf.cast(tf.equal(predictions, labels), tf.float32))\n",
    "    accuracy_values.append(accuracy)\n",
    "\n",
    "# print accuracy\n",
    "print(\"Accuracy:\", np.mean(accuracy_values))\n",
    "# plot per-datum loss\n",
    "loss_values = np.concatenate(loss_values)\n",
    "plt.hist(loss_values, density=True, bins=30)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training by minibatch/gradient step and epoch\n",
    "\n",
    "Now let's re-declare the network with pre-defined layers using [`tf.keras.layers.Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense), group the layers using [`tf.keras.Sequential`](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential), and training the parameters with the [`Adam`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) optimizer.\n",
    "\n",
    "Note how [`tf.GradientTape`](https://www.tensorflow.org/guide/autodiff) is used. We run all the computations which we want to backpropagate gradients through in the scope of the tape and then, after the loss is calculated, we can call `tape.gradient` to calculate the gradient of the output with respect to the model variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# using Sequential groups all the layers to run at once\n",
    "model = tf.keras.Sequential()\n",
    "model.add(tf.keras.layers.Dense(200, tf.nn.relu))\n",
    "model.add(tf.keras.layers.Dense(10))\n",
    "optimizer = tf.keras.optimizers.Adam()\n",
    "\n",
    "loss_values = []\n",
    "accuracy_values = []\n",
    "# Loop through one epoch of data\n",
    "for epoch in range(1):\n",
    "    for batch in tqdm(train_ds):\n",
    "        with tf.GradientTape() as tape:\n",
    "            # run network\n",
    "            x = tf.reshape(tf.cast(batch['image'], tf.float32), [-1, 784])\n",
    "            labels = batch['label']\n",
    "            logits = model(x)\n",
    "\n",
    "            # calculate loss\n",
    "            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels)    \n",
    "        loss_values.append(loss)\n",
    "    \n",
    "        # gradient update\n",
    "        grads = tape.gradient(loss, model.trainable_variables)\n",
    "        optimizer.apply_gradients(zip(grads, model.trainable_variables))\n",
    "    \n",
    "        # calculate accuracy\n",
    "        predictions = tf.argmax(logits, axis=1)\n",
    "        accuracy = tf.reduce_mean(tf.cast(tf.equal(predictions, labels), tf.float32))\n",
    "        accuracy_values.append(accuracy)\n",
    "\n",
    "print(model.summary())\n",
    "    \n",
    "# accuracy\n",
    "print(\"Accuracy:\", np.mean(accuracy_values))\n",
    "# plot per-datum loss\n",
    "loss_values = np.concatenate(loss_values)\n",
    "plt.hist(loss_values, density=True, bins=30)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Splitting the dataset into train/validation\n",
    "\n",
    "After one epoch of training the loss values drop dramatically and accuracy rises from change (\\~10%) to that of a decent classifier (\\~85-90%). In practice we want to train for many epochs and use the set of parameters which gives the lowest validation error. Unfortunately, we don't know which set of parameters is best because we're training on all the data. Before training, we should split the training dataset into a train set, which will be used for parameter updates, and a validation set, which will not. Then, we can determine which parameters generalise best by calculating the accuracy on the hold-out validation set. The parameters with the highest accuracy on validation will likely generalise the best.\n",
    "\n",
    "The easiest way to do this is with TensorFlow Datasets is to use their string indexing notation when loading the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The first 90% of the training data\n",
    "# Use this data for the training loop\n",
    "train = tfds.load('mnist', split='train[:90%]')\n",
    "\n",
    "# And the last 10%, we'll hold out as the validation set\n",
    "# Notice the python-style indexing, but in a string and with percentages\n",
    "# After the training loop, run another loop over this data without the gradient updates to calculate accuracy\n",
    "validation = tfds.load('mnist', split='train[-10%:]')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Homework\n",
    "\n",
    "Your homework is to specify a network with `tf.keras.layers`, train it on the MNIST dataset (as above, but with train/validation split), and try out 2 or 3 variations of different architectures. I.e., change the number of neurons or layers, change the activation function (you can find more in the documentation at [`tf.nn`](https://www.tensorflow.org/api_docs/python/tf/nn)), or even change the optimizer ([`tf.keras.optimizers`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)). Write up a paragraph or two with your observations. E.g., how did it affect the final accuracy on the validation data? How did it affect the rate at which the model improved? Remember to add early stopping and increase the number of training epochs. Submit a `.pdf` with the writeup and `.py` with the code.\n",
    "\n",
    "I'm expecting this to take about an hour (or less if you're experienced). Feel free to use any code from this or previous hackathons. If you don't understand how to do any part of this or if it's taking you longer than that, please let me know in office hours or by email (both can be found on the syllabus). I'm also happy to discuss if you just want to ask more questions about anything in this notebook!\n",
    "\n",
    "### Coda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import HTML\n",
    "# From Colah's Blog, linearly separating spirals with linear transforms and non-linearities\n",
    "# How does a neural network separate entangled data?\n",
    "print(\"We want the blue and red lines to be linearly separable, so how does a neural network manage to do this?\\\n",
    " Let's visualize the linear transformations and non-linearities.\")\n",
    "HTML('<img src=\"http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/img/spiral.1-2.2-2-2-2-2-2.gif\">')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (myenv)",
   "language": "python",
   "name": "myenv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
